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Abstract 

Variational methods have been recently considered for scaling the training process of Gaussian pro¬ 
cess classifiers to large datasets. As an alternative, we describe here how to train these classifiers effi¬ 
ciently using expectation propagation. The proposed method allows for handling datasets with millions 
of data instances. More precisely, it can be used for (i) training in a distributed fashion where the data 
instances are sent to different nodes in which the required computations are carried out, and for (ii) max¬ 
imizing an estimate of the marginal likelihood using a stochastic approximation of the gradient. Several 
experiments indicate that the method described is competitive with the variational approach. 


1 Introduction 

Gaussian process classification is a popular framework that can be used to address supervised machine 
learning problems in which the task of interest is to predict the class label associated to a new instance given 
some observed data m. In the binary classification case, this task is typically modeled by considering a 
non-linear latent function / whose sign at each input location determines the corresponding class label. A 
practical difficulty is, however, that making inference about / in this setting is infeasible due to the non- 
Gaussianity of the likelihood. Nevertheless, very efficient methods can be used to carry out the required 
computations in an approximate way Eia. The result is a non-parametric classifier that becomes more 
expressive as the number of training instances increases. Unfortunately, the cost of all these methods is 
O(n^), where n is the number of instances. 

The computational cost described can be improved by using a sparse representation for the Gaussian 
process / a. A popular approach in this setting introduces additional data in the form ofm<^n inducing 
points, whose location is inferred during the training process by maximizing some function ElISl. This 
leads to a reduced training cost that scales like 0{nrti^). A limitation is, however, that the function to he 
maximized cannot be expressed as a sum over the data instances. This prevents using efficient techniques 
for maximization, such as stochastic gradient ascent or distributed computations. An exception is the 
work described in m, which combines ideas from stochastic variational inference O and from variational 
Gaussian processes (91 to provide a scalable method for Gaussian process classification that can be applied 
to datasets with millions of data instances. 

We introduce here an alternative to the variational approach described in |[7| that is based on expec¬ 
tation propagation (EP) oni. In particular, we show that in EP it is possible to compute a posterior ap¬ 
proximation for the Gaussian process / and to update the model hyper-parameters, including the inducing 
points, at the same time. Moreover, in our EP formulation the marginal likelihood estimate is expressed 
as a sum over the data. This enables using stochastic methods to maximize such an estimate to find the 
model hyper-parameters. The EP updates can also be implemented in a distributed fashion, by spiting the 
data across several computational nodes. Summing up, our EP formulation has the same advantages as 
the variational approach from Q, with the convenience that all computations are tractable and univariate 
quadrature methods are not required, which is not the case of (71. Einally, several experiments, involving 
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datasets with several millions of instances, show that both approaches for Gaussian process classification 
perform similarly in terms of prediction performance. 

2 Scalable Gaussian process classification 

We briefiy introduce Gaussian process classification and the model we use. Then, we show how expectation 
propagation (EP) can be used for training in a distributed fashion and how the model hyper-parameters can 
be inferred using a stochastic approximation of the gradient of the estimate of the marginal likelihood. See 
the supplementary material for full details on the proposed EP method. 

2.1 Gaussian process classification and sparse representations 

Assume some observed data in the form of a matrix of attributes X = (xi,..., with associated labels 
y = (^ 1 ,... where yi G { — 1,1}. The task is to predict the class label of a new instance. Eor 
this, it is assumed the labeling rule = sign(/(xi) + e^), where /(•) is a non-linear function and is 
standard Gaussian noise with probability density ^(6^10,1). Eurthermore, we assume a Gaussian process 
prior over / with zero mean and some covariance function /c(', •) HI. That is, / ^ ^7^(0, A:(', •)). To make 
inference about f = (/(xi),..., /(x^))^ given the observed labels y, Bayes’ rule can be used. Namely, 
p(f|y) = p(y|f)p(f)/p(y) where p(f) is a multivariate Gaussian distribution andp(y) can be maximized 
to find the parameters of the covariance function k. The likelihood of f is p(y |f) = IliLi where 

T>(') is the cdf of a standard Gaussian and fi = This is a non-Gaussian likelihood which makes 

the posterior intractable. However, there are techniques such as the Laplace approximation, expectation 
propagation or variational inference, that can be used to get a Gaussian approximation of p(f|y) EO. 
They all result in a non-parametric classifier. Unfortunately, these methods scale like O(n^), where n is 
the number of instances. 

Using a sparse representation for the Gaussian process / reduces the training cost. A popular method in¬ 
troduces a dataset of m <C n inducing points X = (xi,..., x^)^ with associated values f = (/(xi),..., 
/(x^))^ OO. The prior for / is then approximated as p(f) = /p(f |f)p(f |X)(if « / [lULi p(/i|f)] 
p{^\X.)d^ = pFiTc(f 1^)5 where the Gaussian conditional p(f |f) has been replaced by a factorized dis¬ 
tribution ]Xi=i p{fi\^) - This approximation is known as the full independent training conditional (EITC) 
n, and it leads to a prior pFiTc(f |X) with a low-rank covariance matrix. This prior allows for approxi¬ 
mate inference with a cost linear in n, i.e., 0{nnn?). Einally, the inducing points X can be seen as prior 
hyper-parameters to be learnt by maximizing p(y). 

2.2 Model specification and expectation propagation algorithm 

The first methods based on the EITC approximation do not express the estimate of p(y) as a sum across 
data instances Eia. This makes difficult the use of efficient algorithms for learning the model hyper¬ 
parameters. To avoid this, we follow m and do not marginalize the values f associated to the inducing 
points. Specifically, the posterior approximation is p(f |y) ^ /p(f |f)g(f)df, where g is a Gaussian dis¬ 
tribution that approximates p(f |y), i.e., the posterior of the values associated to the inducing points. To 
obtain q we use first the EITC approximation on the exact posterior: 

I N ^ _ /p(y|f)pnTc(^f)<ifp(f|x) ^ nr=i</’»(f)p(f|x) 

My|x) p(y|x) p(y|x) 

where p(y|f) = nr=i ■PFiTc (f|f) = nr=i^’(/i|f) = nr=i-V(/i|mj,Si) and</>j(f) = J 

= ^{yiTUil^Si + 1), withnii = andSi = Moreover, 

is a matrix with the prior covariances among the entries in f, is a row vector with the prior 
covariances between fi and f and Kj. j. is the prior variance of fi. 
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The r.h.s. of 0 is an intractable posterior due to the non-Gaussianity of each . We use expectation 
propagation (EP) to obtain a Gaussian approximation q Co). In EP each (j)i is approximated as: 

</>i(f) = $ Si 0i(f) = Si exp , (2) 

where is an un-normalized Gaussian factor, Vi = is a m dimensional vector, and Si, i>i and jli 

are parameters to be estimated by EP Importantly, has a one-rank precision matrix, which means that in 
practice only 0{m) parameters need to be stored per each 0^. This is not an approximation and the optimal 
approximate factor has this form (see the supplementary material). The posterior approximation q is 
obtained by replacing in the r.h.s. ofJTJ each exact factor with the corresponding approximate factor 
<j)i. Namely, g(f) = ]Xi=i ^i(^)p (^\^)/where Zq is a normalization constant that approximates the 
marginal likelihood p(y |X). All factors in q are Gaussian, including the prior. Thus, g is a multivariate 
Gaussian distribution over m dimensions. 

EP updates each iteratively until-convergence as follows. Eirst, is removed from q by com¬ 
puting (X q/^i. Then, we minimize the Kullback-Leibler divergence between and g, i.e., 

|g], with respect to g, where Zi is the normalization constant of This can be done by 

matching the mean and the covariances of which can be obtained from the derivatives of log Zi 

with respect to the (natural) parameters of g\\ Given an updated distribution g, the approximate factor is 
= Ziqjc^'^ . This guarantees that is similar to 4>i in regions of high posterior probability as estimated 
by g\^ dll. These updates are done in parallel for efficiency reasons, i.e., we compute g\* and the new g, 
for i = 1 ..., n, at the same time, and then update as before ifT^ . The new approximation g is obtained 
by multiplying all the (pi and p(f |X). The normalization constant of g, Zq, is the EP approximation of 
p(y|X). The log of this constant is 


logZ, =g(0) -g(0prior) + E]log^i log Zi = log Zi + - g{0), (3) 

i=l 

where 0, and ^pnor are the natural parameters of g, g\* and p(f |X), respectively; and g{6) is the 
log-normalizer of a multivariate Gaussian with natural parameters 6. See ca for further details. 

At convergence, the gradient of log Zq with respect to the parameters of any pi is zero ca. Thus, it 
is possible to evaluate the gradient of log Zq with respect to a hyper-parameter {i.e., a parameter of the 
covariance function /c or a component of X) (see the supplementary material). In particular, 

dlogZq jddpnot T ^^prior , ^OlogZi 

where rj and rjpnor are the expected sufficient statistics under g and the prior p(f |X), respectively. With 
these gradients we can easily estimate all the model hyper-parameters by maximizing log Zq. Moreover, it 
is also possible to use g to estimate the distribution of the label of a new instance x^: 


p(2 /*|y,X)?a yp(2 /*|/*)p(/*|f)g(f)dfc(f* . (5) 

Last, because several simplifications occur when computing the derivatives with respect to the inducing 
points C3, the running time of EP, including hyper-parameter optimization, is 0{nnn?). 

2.3 Scalable expectation propagation 

A drawback of EP is that the hyper-parameters of the model are updated via gradient ascent only after 
convergence, which is when 0 is valid. This is very inefficient at the initial iterations, in which the 
estimates of the model hyper-parameters are very poor, and EP may require several iterations to converge. 
We propose to update the approximate factors <pi and the model hyper-parameters Q at the same time. That 
is, after a parallel update of all the approximate factors, we update the hyper-parameters using gradient 
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ascent assuming that each is fixed. Because EP has not converged, the moments of and q 

need not match. Thus, extra terms must be added in Q to get the gradient. Nevertheless, our experiments 
show that the extra terms are very small and can be ignored. In practice, we use 0 for the inner update 
of the hyper-parameters. Figure (left) shows that this approach successfully maximizes log Zq on the 
Pima dataset from the UCI repository lITSll . Because we do not wait for convergence in EP, the method is 
significantly faster. The idea of why this works is as follows. The EP update of each can be seen as 
a (natural) gradient descent step on logZ^ when 0^, with j ^ i, remain fixed ifT^ . Furthermore, those 
updates are very effective for finding a stationary point of log Zq (see ifTOl for further details). Thus, it is 
natural that an inner update of the hyper-parameters when all remain fixed is an effective method for 
finding a maximum of log Zq. 


Pima 




Figure 1: (left) Value of log Zq obtained when updating the hyper-parameters after EP has converged (outer) and 
just after each update of the approximate factors (inner), when using the exact gradient, and the approximation (|^, 
which assumes matched moments between Z~^(j)iq^'^ and q. m — 300. (right) Distribution of the EP updates across 
K computational nodes storing a subset Pi,..., Vk of the data. Best seen in color. 

Distributed training: The method described is suitable for distributed computation using the ideas in 
fm . In particular, the training data can be split in K subsets Pi,..., Vk which are sent to K computa¬ 
tional nodes. A master node stores the posterior approximation q, which is sent to each computational node. 
Then, node k updates each with j G and returns master node. After each node has 

done this, the master node updates q using p(f |X) and the messages received. Because the gradient of the 
hyper-parameters 0 involves a sum over the data instances, its computation can also be distributed among 
the K computational nodes. Thus, the training cost of the EP method can be reduced to 0{nlKm?). 
Figure 0 (right) illustrates the scheme described. 

Training using minibatches: The method described is also suitable for stochastic optimization. In 
this case the data are split in minibatches Mu oi size at most m, the number of inducing points. For 
each minibatch Mk, each with j G Al/c is refined, and q is updated afterwards. Next, the model 
hyper-parameters are updated via gradient ascent using a stochastic approximation of 0 - Namely, 

dZq X ^^prior x ^^prior ^ ^ 

After the update, q is reconstructed. With this training scheme we allow for more frequent updates of 
the model hyper-parameters and the training cost scales like 0{inn?). The memory requirements scale, 
however, like 0{nm), since we need to store the parameters of each approximate factor 0^. 

3 Related work 

A related method for binary classification with GPs uses scalable variational inference (SVI) 171. Since 
p(y|f) = f p(y|f)p(f |f)(if, we obtain the bound logp(y|f) > IE^(f|f) [logp(y|f)] by taking the logarithm 

and using Jensen’s inequality. Let q{^) be a Gaussian approximation of p(f |y). Then, 

logp(y) = log J q{f)p{y\f)p{f\X)/q{f)df > Eg(y)[logp(y|f)] - KL[g(f)||p(f |X)], (7) 
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by Jensen’s inequality, with KL[' 11 •] a Kullback Leibler divergence. Using the first bound in (|^ gives 

logp(y) > IE^(f)[IEp(f|f)[logp(y|f)]] - KL[g(f)||p(f|X)] > Eg(f) [logp(y|f)] - KL[g(f)| |p(f |X)] 

> Er=iE,(/o[logp(^^|/^)] - KL[g(f)||p(f|X)] , (8) 

where g(f) = f p(f |f)g(f)df and q{fi) is the i-th marginal of g(f). Let g(f) = A/’(u|m, S), with m and 
S variational parameters. Because p(f|f) = A/’(f|Af, Kff — AKj^^), where A = and is a 

matrix with the covariances between pairs of observed inputs and inducing points, 


g(f) = Ar(f I Am, Kff + A(S - K^) A^). 


(9) 


S is encoded in practice as LL^ and the lower bound ^ is maximized with respect to m, L, the inducing 
points X and any hyper-parameter in the covariance function using either batch, stochastic or distributed 
optimization techniques. In the stochastic case, small minibatches are considered and the gradient of 
Sr=i subsampled and scaled accordingly to obtain an estimate of the exact 

gradient. In the distributed case, the gradient of the sum is computed in parallel. Once ^ has been 
optimized, ^ can be used for making predictions. The computational cost of this method is 0{nw?), 
when trained in a batch setting, and when using minibatches and stochastic gradients. A parctical 

disadvantage is, however, that [logp{yi\fi)] has no analyitic solution. Importantly, these expectations 

and their gradients must be approximated using quadrature techniques. By contrast, in the proposed EP 
method all computations have a closed-form solution. 

In the generalized FITC approximati on (G FITC) |[6l the values f associated to the inducing points 
are marginalized, as indicated in Section [2T| This generates the FITC prior pFiTc(f|X) which leads to 
a computational cost that is 0{nnn?). Expectation propagation (EP) is also used in such a model to ap¬ 
proximate p(f |y, X). In particular, EP replaces with a Gaussian factor each likelihood factor of the form 
p{.yi\fi) = ^{yifi)' A limitation is, however, that the estimate of p(y|X) provided by EP in this model 
does not contain a sum over the data instances. Thus, GFITC does not allow for stochastic nor distributed 
optimizaton of the model hyper-parameters, unlike the proposed approach. 

A similar model to one described in Section 2.2 is found in CD. There, EP is also used for approximate 
inference. However, the moments of the process are matched at f, instead of at f. This leads to equivalent, 
but more complicated EP udpates. Moreover, the inducing points (which are noisy) are not learned from 
the observed data, but kept fixed. This is a serious limitation. The main advantage with respect to GFITC 
is that training can be done in an online fashion, as in our proposed approach. 


4 Experiments 

We follow (71 and compare in several experiments the proposed method for Gaussian process classification 
based on a scalable EP algorithm (SEP) with (i) the generalized FITC approximation (GFITC) |0| and (ii) 
the scalable variational inference (SVI) method from Q. SEP and GFITC use faster parallel EP updates 
that require only matrix multiplications and avoid loops over the data instances (121. All methods are 
implemented in R. The code is found in the supplementary material. 

4.1 Performance on datasets from the UCI repository 

A first set of experiments evaluates the predictive performance of SEP, GFITC and SVI on 7 datasets 
extracted from the UCI repository ca. We use 90% of the data for training and 10% for testing and 
report averages over 20 repetitions of the experiments. All methods are trained using batch algorithms 
for 250 iterations. Both GFITC and SVI use L-BFGS-B. We report for each method the average negative 
test log-likelihood. A squared exponential covariance function with automatic relevance determination, an 
amplitude parameter and an additive noise parameter is employed. The initial inducing points are chosen at 
random from the training data, but are the same for each method. All other hyper-parameters are initialized 
to the same values. A different number of inducing points m are considered. Namely, 15%, 25% and 
50% of the total number of instances. The results of these experiments are displayed in TableThe best 
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performing method is highlighted in bold face. We observe that the proposed approach, i.e., SEP, obtains 
similar results to GFITC and SVI and sometimes is the best performing method. Table also reports the 
average training time of each method in seconds. The fastest method is SEP followed by SVI. GFITC 
is the slowest method since it runs EP until convergence to evaluate then the gradient of the approximate 
marginal likelihood. By contrast, SEP updates at the same time the approximate factors and the model 
hyper-parameters. 


Table 1: Average negative test log likelihood for each method and average training time in seconds. 



m = 15% 

m = 25% 

m — 50% 

Problem 

GFITC 

SEP 

SVI 

GFITC 

SEP 

SVI 

GFITC 

SEP 

SVI 

Australian 

.68 

± 

.06 

.69 ± .07 

.63 ± 

.05 

.68 ± 

.08 

.67 ± .07 

.63 ± 

.05 

.67 ± .09 

.64 ± .05 

.63 ± 

.05 

Breast 

.10 

± 

.05 

.11 ± .05 

.10 ± 

.05 

.11 ± 

.06 

.11 ± .05 

.10 ± 

.05 

.11 ± .05 

.11 ± .05 

.10 ± 

.05 

Crabs 

.07 

± 

.07 

.06 ± .06 

.07 ± 

.06 

.06 ± 

.07 

.06 ± .06 

.07 ± 

.07 

.06 ± .07 

.06 ± .06 

.09 ± 

.06 

Heart 

.43 

± 

.12 

.40 ± .13 

.39 ± 

.11 

.42 ± 

.12 

.41 ± .12 

.40 ± 

.11 

.42± .13 

.41 ± .11 

.40 ± 

.10 

Ionosphere 

.30 

± 

.22 

.26 ± .19 

.26 ± 

.14 

.29 ± 

.23 

.27 ± .20 

.27 ± 

.18 

.30 ± .24 

.27 ± .19 

.26 ± 

.16 

Pima 

.54 

± 

.08 

.52 ± .07 

.49 ± 

.05 

.53 ± 

.07 

.51 ± .06 

.50 ± 

.05 

.53 ± .07 

.50 ± .05 

.49 ± 

.05 

Sonar 

.35 

± 

.13 

.33 ± .10 

.40 ± 

.17 

.35 ± 

.12 

.32 ± .10 

.40 ± 

.19 

.35 ± .13 

.29 ± .09 

.35 ± 

.16 

Avg. Time 

59 

± 

4 

17 ± 1 

40 ± 

2 

133 ± 

6 

37 ± 2 

65 ± 

3 

494 ± 29 

130± 5 

195 ± 

10 


4.2 Learning the location of the inducing points 

Using the setting of the previous section, we focus on the two dimensional Banana dataset and analyze 
the location of the inducing points inferred by each method. We initialize the inducing points at random 
from the training set and progressively increase their number m from 4 to 128. Figure]^ shows the results 
obtained. For small values of m, i.e., m = 4 or m = 8, SEP and SVI provide very similar locations 
for the inducing points. The estimates provided by GFITC for m = 4 are different as a consequence of 
arriving to a sub-optimal local maximum of the estimate of the marginal likelihood. If the initial inducing 
points are chosen differently, GFITC gives the same solution as SEP and SVI. We also observe that SEP 
and SVI quickly provide {i.e., for m = 16) estimates of the decision boundaries that look similar to the 
ones obtained with larger values of m {i.e., m = 128). These results confirm that SEP is able to find good 
locations for the inducing points. Finally, we note that SVI seems to prefer placing the inducing points 
near the decision boundaries. This is certainly not the case of GFITC nor SEP, which seems to inherit this 
behavior from GFTIC. 

4.3 Performance as a function of time 

We profile each method and show the prediction performance on the Image datasets a function of the 
training time, for different numbers of inducing points m = 4, 50, 200. Again we use 90% of the data 
for training and 10% for testing. We report averages over 100 realizations of the experiments. The results 
obtained are displayed in Figure (left). We observe that the proposed method SEP provides the best 
performance at the lowest computational time. It is faster than GFITC because in SEP we update the 
posterior approximation q and the hyper-parameters at the same time. By contrast, GFITC waits until EP 
has converged to update the hyper-parameters. SVI also takes more time than SEP to obtain a similar 
level of performance. This is because SVI requires a few extra matrix multiplications with cost 0{nnn?) 
to evaluate the gradient of the hyper-parameters. Furthermore, the initial performance of SVI is worse 
than the one of GFITC and SEP. After one iteration, both GFITC and SEP have updated each approximate 
factor, leading to a good estimate of q, the posterior approximation, which is then used for hyper-parameter 
estimation. By contrast, SVI updates q using gradient descent which requires several iterations to get a 
good estimate of this distribution. Thus, at the beginning, SVI updates the model hyper-parameters when 
q is still a very bad approximation. 
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Figure 2: Effect of increasing the inducing points for SEP, SVI and GEITC. Each column shows a different number 
of inducing points from m = 4 to m = 128. Blue and red points represent training data from the banana dataset. 
Inducing points are black dots and decision boundaries are black lines. Best seen in color. 


4.4 Training in a distributed fashion 

We illustrate the utility of SEP and SVI to carry out distributed training (GEITC does not allow for this). 
We consider the MNIST dataset and use 60, 000 instances for training and 10, 000 instances for testing. 
The number of inducing points m is set equal to 200 in both SEP and SVI. The task is to discriminate odd 
from even digits, which is a highly non-linear problem. We distribute the data across an increasing number 
of nodes from 1 to 12 using a machine with 12 CPUs. The process of distributed training is simulated 
via the R package doMC, which allows to execute for loops in parallel with a few lines of code. In SVI 
we parallelize the computation of the terms (corresponding either to both the lower bound or the gradient) 
that depend on the training instances. In SEP we parallelize the updates of the approximate factors and the 
computation of the estimate of the gradient of hyper-parameters. Eigure (right) shows the training time 
in seconds of each method as a function of the number of nodes (CPUs) considered. We observe that using 
more than 1 nodes significantly reduces the training time of SEP and SVI, until 6 nodes are reached. After 
this, no improvements are observed, probably because process synchronization becomes a bottle-neck. The 
test error and the avg. neg. test log likelihood of SVI is 2.2% and 0.0655, respectively, while for SEP they 
are 2.7% and 0.0694. These values are the same independently of the number of nodes considered. 

4.5 Training using minibatches and stochastic gradients 

We evaluate the performance of SEP and SVI on the MNIST dataset considered before when the training 
process is implemented using minibatches of 200 instances. Each minibatch is used to update the posterior 
approximation q and to compute a stochastic approximation of the gradient of the hyper-parameters. Note 
that GEITC does not allow for this type of stochastic optimization. The learning rate employed for updating 
the hyper-parameters is computed using the Adadelta method in both SEP and SVI with p = 0.9 and 
e = 10-^ CSl. The number of inducing points is set equal to the minibatch size, i.e., 200. We report the 
performance on the test set (prediction error and average negative test log likelihood) as a function of the 
training time. We compare the results of these methods (stochastic) with the variants of SEP and SVI that 
use all data instances for the estimation of the gradient (batch). Eigure|^(top) shows the results obtained. 
We observe that stochastic methods (either SEP or SVI) obtain good results even before batch methods 
have completed a single hyper-parameter update. Eurthermore, the performance of the stochastic variants 
of SEP and SVI in terms of the test error or the average negative log likelihood with respect to the running 
time is very similar. 
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Figure 3: (left) Prediction performance of each method on the Image dataset as a function of the training time mea¬ 
sured in seconds (in a log^Q scale). Different numbers of inducing points are considered, i.e., m = 4, 50, 200. Best 
seen in color, (right) Average training time in seconds for SEP and SVI on the MNIST dataset as a function of the 
number of computational nodes employed in the process of distributed training. 


Our last experiments consider information about all commercial flights in the USA from January 2008 
to April 2008 (available at http://stat-computing.org/dataexpo/2009/). The task is the same as in (Tl. 
Namely, to predict whether a flight was delayed or not based on 8 attributes: age of the aircraft, dis¬ 
tance that needs to be covered, airtime, departure time, arrival time, day of the week, day of the month and 
month. After removing instances with missing values 2,127, 068 instances remain. From these, 10, 000 
are used for testing and the rest are used for training the stochastic variants of SVI and SEP (batch methods 
are infeasible in this dataset). We use a minibatch of size 200 and set m = 200 and compare results with 
a logistic regression classifier. The results obtained are displayed in Figure [^(bottom). We observe that 
both SEP and SVI outperform the linear model, which shows that the problem is non-linear. Eventually 
SEP and SVI provide similar performance results, probably because in this large dataset the posterior dis¬ 
tribution is very close to be Gaussian. However, SEP improves results more quickly. This supports that, at 
the beginning, the EP updates of SEP are more effective for estimating q than the gradient updates of SVI. 
More precisely, SVI is probably updating the hyper-parameters using a poor estimate of q during the first 
iterations. 


5 Conclusions 

We have shown that expectation propagation (EP) can be used for Gaussian process classification in large 
scale problems. Our scalable variant of EP (SEP) allows for (i) training in a distributed fashion in which 
the data are sent to different computational nodes and (ii) for updating the posterior approximation and the 
model hyper-parameters using minibatches and a stochastic approximation of the gradient of the estimate 
of the marginal likelihood. The proposed method, SEP, has been compared with other approaches from the 
literature such as the generalized EITC approximation (GEITC) and a scalable variational inference (SVI) 
method. Our results show that SEP outperforms GEITC in large datasets in which that method becomes 
infeasible. Eurthermore, SEP is competitive with SVI and in large datasets provides the same or even better 
performance at a similar computational cost. If small minibatches are used for training, the computational 
cost of SEP is where m is the number of inducing points. A disadvantage is, however, that the 

memory requirements are 0{nm), where n is the number of instances. Einally, SEP seems to provide better 
results than SVI at the beginning. This is probably due to a better estimation of the posterior approximation 
q by using the EP updates (free of any learning rate) than by the gradient steps employed by SVI. 
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Figure 4: (top) Average test error and and average negative test log likelihood for SEP and SVI as a function of training 
time on the MNIST dataset. We report results for the variants that use a minibatch size equal to 200 to approximate 
the gradients (stochastic) and for the variants that use all data instances for the gradient evaluation (batch), (bottom) 
Same results for the Airline delays dataset where batch methods are not feasible. The performance of a linear logistic 
regression classiher is also displayed. Best seen in color. 
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A Supplementary Material 

Here we give all the necessary details to implement the EP algorithm for the proposed method described in 
the main manuscript, i.e. SEP. In particular, we describe how to compute the EP posterior approximation 
from the product of all approximate factors and how to implement the EP updates to refine each approx¬ 
imate factor. We also give an intuitive idea about how to compute the EP approximation to the marginal 
likelihood and its gradients. Note that the updates described are very similar to the ones in 1^ . 
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A.l Reconstruction of the posterior approximation 

In this section we show how to obtain the posterior approximation as the normalized product of the approx¬ 
imate factors 0i(f) and the prior p(f |X). From the main manuscript, we know that these factors have the 
following form: 

^i(J) Vivjf+ ilif , (10) 

p(f|X)=V(f|0,Kff), (11) 

where Vi = and is a covariance matrix of size m x m with the prior covariance among 

the values associated to the inducing points X. Both the approximate factors and the prior are Gaus¬ 
sian, a family of distributions that is closed under product and division. The consequence is that q{^) = 
is also Gaussian. In particular, ^(f) = X). To obtain the parameters of q 

we can use the formulas given in the Appendix of (211. This gives, 

S= (Kjj1 + YAYt)“\ (12 ) 

^ = SYA (13) 

where A is a diagonal matrix with diagonal entries equal to T is a matrix whose i-th column is equal 
to Vi, and /i is a vector whose i-th component is equal to fli. These computations have a cost 0{nnn?), 
under the assumption that m n. Otherwise the cost is 0{m^). 

A.l Computation of the cavity distribution 

Before the update of each 0^, the first step is to compute the cavity distribution q"^^ (x q/^i. Because q and 
are Gaussians, so it is In particular, g^*(f) = A/’(f The parameters of q^'^ can also be 

obtained using the formulas given in the Appendix of ED. That is, 

= (S-i - = S + , (14) 

lj\i ^ V - P'iVi) = M - fli) , (15) 

where we have used the Woodhury matrix identity and that +i'iVivJ. These computations 

have a cost that is 0{ni?). 

A.3 Update of the approximate factors 

In this section we show how to find the approximate factors 0^. For that we consider that the corresponding 
cavity distribution has already been computed. From the main manuscript, we know that the exact 
factor to be approximated is: 

= J ^{yifiW{fi\mi,Si)dfi = $ , (16) 

where ^(•) is the c.d.f. of a standard Gaussian, rrii = ^K^^f and Si = Kj.j. — . We 

compute Zi, i.e., the normalization constant of (j)iq^'^, as follows: 

= h (;©r) 

where Oi = and = 1 + By using 

the equations given in the Appendix of ED it is possible to obtain the moments, i.e., the mean fi and the 
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( 18 ) 


covariances S of from the derivatives of log Zi with respect to the parameters of Namely, 






= s\* - 

ff tfi fit ff 


ff ^fi 
T ^ 

2 I 

at + 


hi 


(19) 


where 


M{yiai/y/hl\^, 1) Vi 

^{yidi/Vhl) hi ' 


( 20 ) 


These are very similar to the EP updates described in (201. 

Given the previous updates, it is possible to find the parameters of the corresponding approximate factor 
which is simply obtained as ^i = Ziq ^^^where is a Gaussian distribution with the mean and 
the covariances of We show here that the precision matrix of the approximate factor has a low 

rank form. Denote with to such matrix. Let also iiii be the precision matrix of ^i times the mean vector. 

Define Vi = . Then, by using the equations given in the Appendix of fl\\ we have that 


Vi = s-i-(s\*) ' = (s\') ^=VivJvi (21) 

rhj = S“^m - = (^ai + aiUi + aivjY^’^Viuf^ Vi = jliVi (22) 


where we have used the Woodbury matrix identity, the definition of ih and 5], and 

n -1 


ai + 






fti = ai + aii>i + aivjTi'^^ViUi . 


(23) 


Thus, we see that the approximate factor has the form described in ( p^ . 

Once we have the parameters of the approximate factor ^i, we can compute the value of Si in (10) 
which guarantees that the approximate factor integrates the same as the exact factor with respect to g^*. Let 
6 be the natural parameters of g after the update. Similarly, let 6^'^ be the natural parameters of g\\ Then, 


Si = log Zi + g{e\^) - g{e ), 

where ^(0) is the log-normalizer of a multi-variate Gaussian with natural parameters 9). 


(24) 


A.4 Parallel EP updates and damping 


The updates described for the approximate factors are done in parallel. That is, we compute the required 
quantities to update each factor ^i at the same time using (^). Then, the new parameters of each ap¬ 
proximate factor Vi and fii are computed based on the previous ones. Linally, after the parallel update, we 
recompute g as indicated in Section [AT] All these operations have a closed-form and involve only matrix 
multiplications with cost O(nm^), where n is the number of samples and m is the number of inducing 
points. 

Parallel EP updates were first proposed in (2^ and have been also used in the context of Gaussian 
process classification in ca. Parallel EP updates are much faster than sequential updates because they 
avoid having to code loops over the training instances. All operations simply involve matrix multiplications 
which are significantly faster as a consequence of using the BLAS library (available in most scientific 
programming languages such as R, matlab or Python) that has been significantly optimized. 

Parallel updates may deteriorate EP convergence in some situations. Thus, we also use damped EP 
updates. Damping is a standard approach in EP algorithms which significantly improves convergence. The 
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idea is to avoid large changes in the parameters vi and jli of the approximate factors (pi. For this, the 
parameters after the EP updates are set to be a linear combination of the old and the new parameters. In 
particular, 


ui = pvr + (1 - , A* = wr + (1 - , (25) 

where p G [0,1] is a parameter controlling the amount of damping. If p = 1 there is no damping and if 
p = 0 the parameters of each pi are not updated at all. In our experiments we set p = 0.5 when doing batch 
training and we set p = 0.99 when the training process is done in a stochastic fashion using minibatches (in 
this case we do more frequent reconstructions of g, i.e., after processing each minibatch and less damping 
is needed). Damping does not change the fixed points of EP. 

A.5 Estimate of the marginal likelihood 

As indicated in the main manuscript, the estimate of the marginal likelihood is given by 

n 

log Zg = g{e) - ff (gprior) + log Zj log Zi = log Zi + g{e'^’‘) - g{d), (26) 


where 6, 0^'^ and ^pnor are the natural parameters of g, and p(f |X), respectively; and g{6) is the log- 
normalizer of a multivariate Gaussian distribution with natural parameters 6. Let m and S be the variance 
and the mean, respectively, of a Gaussian distribution over m dimensions with natural parameters O'. Then, 

9{d') = Ylog27r+ ^loglS"! + ^rn^S~^m. (27) 


The consequence is that 


11 1 ^ ^ 
log Zg = - log |S| + V - o log l^ff I + yi log 




(28) 


with 


= logZi + Log|S\*| + (S\*) - Log|S| - 

= logZi - 2ilivln + fiivjYiVi + Ci - 2ijJvivjYiViiiiCi 

+ PiCi i log(l - ViViHvi) , (29) 

where we have used that ^ — i>iVivJ, the Woodbury matrix identity, the matrix determinant 

lemma, that and set Ci = — ViT,Vi)~^. The consequence is that the 

computation of log Zq can be done with cost 0{nrin?) if m <C n. 

A.6 Gradient of log Zq after convergence 

In this section we show that the gradient of log Zq, after convergence, is given by the expression given in 
the main manuscript. For that, we extend the results of ca. Denote by to one hyper-parameter of the 
model. That is, a parameter of the covariance function /c or a component of the inducing points. Then, the 
gradient of log Zq with respect to this parameter is: 
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( 30 ) 










where 6, 0^'^ and Opnor are the natural parameters of q, and the prior p(f |X), respectively. Impor¬ 
tantly, the term logZ^ depends on in a direct way, i.e., because the exact likelihood factor 0i(f) = 
/ ^{yifi)N'{fi\mi,Si)dfi = withm^ = and = K/,/. - 

depends on ^j, and in an indirect way, i.e., because the natural parameters of the cavity distribution q^\ 
6^^, depend on ^j. In particular. 


= y* 0i(f)exp|(^0\*^ h{^) - g{6^^)'^ d^ , 
where h{f) are the sufficient statistics of The consequence is that 


Only 0i(f) changes 

d\ogZi _ abJiT /dlog Zi^ dd\* 


Only 0i(f) changes 



(31) 


(32) 


where rj and are the expected sufficient statistics under the posterior approximation q and the cavity 
distribution q'^'^. Recall that we have assumed convergence which leads to a match of the moments between 

Z^^(f)iq'^'^ and q. 

If we substitute ( [3^ in ( [30| ) we have that: 


d log Zj ^ / dg{e) \^ ^ _ ( ^g(^prior) 

d^j \ de J d^j V ^^prior 

J d9\^ 


da 


prior 


E 




i=l 

ae 


-V (r?prior 


d^j 

T 9^1 


f dg(0\^ 

^ I aev 

i=l ^ 


aiogZj ^ j de\^ 

de\^ ^/dg{a)Yde 
96 ) a^^ 


a^j 


prior 


a^j 


E 


a log Zi 


i=l ^ i=l 


i=l 

T ae\ 


a^j 


9i 


E^' 

i=l 


T-ae 

^’’w, 


dij 


dO T S^prior 

= ^ ^ - (^prior) 


_ T ( 

Wj~ 

T ( \ 

='^wr 

ae 

= v‘ 


9^3 

T ddt 


E 

i=l 


i=l 

a log Zi j 

i=l 


9^3 


ae\^ _ ae\ 

9ij d^j) 


prior 


9^j 

T dd, 


prior 


9^3 


E 

i=l 

n 

E 


aiogz, 


E iddi 
Q^\prior 


d log Zi j 


9i, acj 


- iVprior) 


i=l 

T 96, 


prior 


E 


acj 

a log Zi 


= V 


T^^piior / s,T ^^prior 

-(^prior) —^ 


9^3 


9ij 


E 

i=l 


9^^ ,=i 

aiogZi 


9^3 


9S.3 


(33) 


where rjpnor are the expected sufficient statistics of the prior and we have used that 6 = ^pnor + 
with 6i the natural parameters of the approximate factor ^i, and that Thus, at conver¬ 

gence the approximate factors can be considered to be fixed. In particular, ( [^ is the gradient obtained 
under the assumption that all remain fixed and do not change with the model hyper-parameters. 
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The chain rule of derivatives has to be taken with care in the previous expression. Since the natural 
parameters and the expected sufficient statistics are often expressed in the form of matrices, the chain rule 
for matrix derivatives has to be employed in practice (see (23] Sec. 2.8.1]). The consequence is that 


T ^^prior / x 


T 96 ^ 


pnor 


d^j 


-0.5tr 



(34) 


where 


M = • 

ff ff ff ff ff 


(35) 


In the case of computing the derivatives with respect to the inducing points several contractions occur, as 
indicated in Ha. The computational cost of obtaining these derivatives is 

The derivatives with respect to each log Zi can be computed also efficiently using the chain rule for 
matrix derivatives indicated in (23 Sec. 2.8.1]. The computational cost of obtaining these derivatives is 
0{nnn?). Furthermore, several standard properties of the trace can be employed to simplify the computa¬ 
tions. In particular, the trace is invariant to cyclic rotations. Namely, tr(ABCD) = tr(DABC). 

By using the gradients described, it is possible to maximize log Zq to find good values for the model 
hyper-parameters. However, as stated in the main manuscript, we do not wait until EP converges for doing 
the update. In particular, we perform an update of the hyper-parameters considering the as fixed, after 
each parallel refinement of the approximate factors. Because we are updating the approximate factors too, 
we cannot simply expect that such steps always improve on the objective log Zq, but in practice they seem 
to work very well. In our experiments we use an adaptive learning rate that is different for each hyper¬ 
parameter. In particular, we increase the learning rate by 2% if the sign of the estimate of the gradient 
for that hyper-parameter does not change between two consecutive iterations. If a change is observed, we 
reduce we multiply the learning rate by 1/2. If an stochastic approximation of the estimate of the gradient 
is employed, we use the ADADELTA method to estimate the learning rate (H. 


A.7 Predictive distribution 

Once the training process is complete, we can use the posterior approximation q for making predictions 
about the class label G { — 1,1} of a new instance x^. In that case, we compute first an approximate 
posterior for the Gaussian process evaluated at the target location, i.e., /(x^), which is summarized as /^: 

p(/*|y,X)« jp{f^\{)q{f)df 

« A/'(/*|to*,s*) , (36) 


where irii^ — fj, and - 1 - 5jK— and 

contain the prior variance of and the prior covariances between and f, respectively. The approximate 
predictive distribution for the class label is simply: 


p(y*|y,X) = y'p(2/*|/*)p(/*|y,X)d/* = J = <^> 


f \ 


where 4>(') is the c.d.f of a standard Gaussian distribution. 


(37) 
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